Expressing Program Analyses in Datalog

Michael Martin and Monica S. Lam

Computer Science Department
Stanford University
USA

Editor: Roberto Bagnara



PDF Version

Abstract

Declarative rule-based languages like Datalog have been shown to be suitable for writing program analyses. We show how implementing Datalog using a data structure called Binary Decision Diagrams enables the computation of precise pointer aliases in real life applications. We then demonstrate how pointer analysis enables important program analyses that build upon these results.

Introduction

Pointer alias analysis was a long-standing problem in the literature of program analysis [2]. Two aspects of the problem are particularly troublesome:
  • Many problems in optimization and program understanding require accurate pointer information to work.
  • The result sets of a pointer analysis for a real-life program include astronomical numbers of facts - one survey found on the order of 1040 possible results in the complete result.
Research in the past attempted to compute results as needed. However, an unresolved pointer access can potentially ruin the accuracy of many results, making on-demand analysis ineffective for pointer-alias problems.

There is, however, a way out of this. Points-to information is often highly repetitive, and as a result, it can be dramatically compressed. A data structure called Binary Decision Diagrams, or BDDs, represents repetitive data very well, and can produce an exponential reduction in the amount of space required to represent pointer results. BDD-based pointer results are compact enough that a pointer analysis can work out all pointer information a priori and encode its results for a subsequent analysis to use.

BDDs are not easy to work with directly. We can hide the details of BDD manipulation by allowing analyses to be written in the logic programming language Datalog and developing a compiler that translates Datalog into optimized BDD code.. In this article we will introduce BDDs, show their mapping to Datalog, and present a concise, precise pointer analysis. We then discuss how to build upon the results of the pointer analysis to build new special-purpose analyses.

Figure 1: Uncompressed (left) and compressed (right) BDDs for the set 110, 111

Binary Decision Diagrams

Binary Decision Diagrams are a data structure for compactly representing sets of bit strings. The bit strings in a set are represented as paths through a diagram, stopping when a decision has been reached. Each internal node in the diagram represents a decision of the path taken according to the value of the bit. The leaf node is either 0 or 1 signifying whether the bit pattern is an element of the set. Common bit patterns among members of a set translate to a compact representation.

An example is shown in Figure 1. Both diagrams encode the set {110, 111}. The left diagram is "complete" - each bit leads to a new node, and once all bits are processed, a conclusion is reached. The right diagram simplifies this to only the necessary nodes, which compresses the representation to 4 nodes instead of 15.

BDDs and Datalog

Datalog's structure maps well to the concept of BDDs:
  • Each domain of values maps to a bitfield wide enough to contain all possible values for that domain.
  • Each value in a domain maps to a unique bit pattern within that domain.
  • Each relation is encoded as a bit field whose width is the sum of the widths of the domains of each parameter in the relation.
  • Each fact in a relation is the concatenation of the values of   each of its parameters.
  • Datalog rules specify facts in terms of previously existing  facts - these deductions translate directly to BDD operations.
Whaley and others have designed and developed a system called bddbddb that exploits this mapping generically [3]. This platform has proven useful for a wide variety of problems in program analysis.

Pointer Analysis with BDDs

For pointer analysis, a program can be abstracted into just a few relations:
  • assign(v1, v2) - This represents an instruction v1 = v2.
  • load(v1, f, v2) - This represents an instruction v2 = v1.f
  • store(v1, f, v2) - This represents an instruction v1.f = v2.
  • varInit(v,h) - This represents the initial pointer information, that a variable v has been assigned  a newly constructed heap object h.
Given these basic sets of facts, we can specify a complete pointer analysis very concisely. The Berndl points-to analysis [1], one of the first practical pointer analyses using BDDs, is expressible in Datalog in four lines. The relation varPoints(v,h) represents points-to information: that a
variable v may point to heap location h. Likewise, fieldPoints(h1, f, h2) represents that for an object h1, the
field h1.f may point to h2.

varPoints(v,h)  :-  varInit(v, h). 
varPoints(v1, h) :- assign(v1, v2), varPoints(v2, h).
fieldPoints(h1, f, h2) :- store(v1, f, v2), varPoints(v1, h1),
varPoints(v2, h2).
varPoints(v2, h2) :- load(v1, f, v2), varPoints(v1, h1),
fieldPoints(h1, f, h2).
In object-oriented systems such as Java, pointer information also influences the call graph. Whaley and Lam extend this analysis further to produce an invokes relation computed alongside the points-to sets [4]. The varPoints and invokes relations combined are sufficient to implement a wide range of additional analyses.

Program Analyses

One particularly compelling use of pointer analysis is ot automatically detect errors and security vulnerabilities in programs. Armed with accurate pointer information, many rules governing the correctness or security of programs can be readily expressed in Datalog.

As an example, consider the class of problems known as taint analysis. This class includes SQL injection and Cross-site scripting, two of the most common vulnerabilities in Web applications. In these problems, we seek to discover whether user input can flow to some sensitive place such as a system command (letting the user run any program on the server with the privileges of the service) or a database request (letting the user bypass security or alter data). A simple analysis for this is expressible in two lines:
badInput(h) :- returns(i, v), varPointsTo(v, h),
invokes(i, "getInput").
error(h) :- badInput(h), argument(i, v), varPointsTo(v, h),
invokes(i, "system").
This looks for calls to getInput and system where the return value of the first is the same as the argument to the second. This is somewhat naive, but this illustrates how analyses can be written in Datalog. Results of analyses can easily be used in more complex analyses.

Conclusion

Program analysis is largely a matter of iterative, systematic deduction based on facts about the target program. As such, rule-based systems such as Datalog can represent the core of an analysis easily. Furthermore, binary decision diagrams (BDDs) permit an efficient implementation of many analyses across sizable programs.

References

[1] M. Berndl, O. Lhotk, F. Qian, L. Hendren, and N. Umanee. Points-to analysis using BDDs. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, pages 103-114, 2003.

[2] Michael Hind. Pointer analysis: Haven't we solved this problem yet? In Proceedings of the 2001 ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE'01), pages 54-61, 2001.

[3] John Whaley et al. The bddbddb analysis system. http://bddbddb.sourceforge.net/.

[4] John Whaley and Monica S. Lam. Cloning-based context-sensitive pointer alias analysis using Binary Decision Diagrams. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI), pages 131-144, 2004.